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Abstract 

The Independent Component Analysis is a recently developed technique for component 
extraction. This new method requires the statistical independence of the extracted components, 
a stronger constraint that uses higher-order statistics, instead of the classical decorre lation, a 
weaker constraint that uses only second-order statistics. This technique has been used recently 
for the analysis of geophysical time series with the goal of investigating the causes of variability 
in observed data (i.e. exploratory approach). We demonstrate with a data simulation 
experiment that, if initialized with a Principal Component Analysis, the Independent 
Component Analysis performs a rotation of the classical PCA (or EOF) solution. This rotation 
uses no localization criterion like other Rotation Techniques (RT), only the global generalization 
of decorrelation by statistical independence is used. This rotation of the PCA solution seems to 
be able to solve the tendency of PCA to mix several physical phenomena, even when the signal is 
just their linear sum. 
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1. Introduction 

This work concerns methods for the investigation 
of the physical causes of variability of a dynamical 
system, e.g., the climate, from observations of its be- 
havior. The observed time series of the system’s state 
can be produced by a mixture of different compo- 
nents representing different physical processes. In the 
most general case, the time series, x(j), with tempo- 
ral dimension, N, at a particular spatial coordinate, 
j £ {1, . . . , M}, where M is the spatial dimension, is 
the result of the mixture of these components, cr(j), 
by an operator Q\ 

x (j) = G(er(J))- (1) 

In this paper we use lower (upper) case bold letters 
to indicate vectors (matrices) and we will refer to a 
particular spatial location as a pixel. We consider 
decomposition in time (i.e., the observations, x(/>), 
are time series at each pixel, j), but the following 
discussion would be the same for a decomposition in 
space. 

The goal of the analysis method is to infer the i n- 
known contributing components, cr(j) y from the co- 
served data, x(j). Often in performing such an anal- 
ysis, one does not know a priori much about what Q 
is like, even whether it is linear or not. We introduce 

h = J(x) ~ cr, (2) 

where J is an estimate of the unknown inverse map- 
ping, Q~ l , and h is an estimate of the unknown vector 
cr. Analysis methods that estimate J and h are called 
component extraction techniques. 

If Q is nonlinear and cannot be usefully linearized, 
then we are faced with a component extraction prob- 
lem that is highly complex for many reasons. First, 
the definition of a well-adapted nonlinear component 
extraction model, J , is difficult without some a pri- 
ori information about the nonlinear mixture model, 
Q. Using generic statistical models for the nonlin- 
ear regressions of J often introduces too many de- 
grees of freedom, which ruins the inference process. 
Second, the determination of the extracted compo- 
nents is much more complex since the basis functions, 
(o’O’)) > va rv with location, j . Third, the uniqueness 
of the result and its interpretation in physical-process 
terms is difficult to demonstrate. Some nonlinear 
techniques have been developed ( Monahan 2000) but 
their main application is to represent complex data in 
a more compact way (i.e., compression). Here, we are 
not concerned with this use of such techniques, but 


with their use to extract “meaningful” (i.e. causal) 
components to understand how a dynamical system 
works. Currently, all the “classical” (and most fre- 
quently used) extraction methods are linear; the use 
of nonlinear models in (4) for component extraction 
is only just beginning. The nonlinear case is beyond 
the scope of this paper. 

One approach to simplify the analysis that may 
when Q is complex is to linearize (1) using G(a°(j)), 
the Jacobian matrix of the nonlinear operator Q near 
a particular state, cr°(j): 

Ax(j) = G(cr°(j)) • A cr(j) (3) 

= 9i(o-°(j))-A<Ti(j)+ 92 ((7 0 (j))Acr2(j)+---+g Q (<r 0 (;))-A(TQ(j) 

(4) 

where the temporal basis functions, g x (<x°(j)), . . . ,0 q(ct°(J)), 
which are the columns of the matrix G(cr°(j )), are 
unknown time series describing a fixed dynamical be- 
havior at each pixel, j. Each p J (cr°(;')) is the temporal 
response to a perturbation A a^j) of i th component 
at pixel, j, when the state is given by cr°(j). For ex- 
ample, in the case where the physical component, z, 
is an oscillating wave propagating in space, the time 
series g^cr 0 ^)) have the same shape as the source, 
but with a time delay dependent on the distance be- 
tween their location at pixel, j, and the source of the 
wave. 


When Q is known a priori to be linear or has been 
linearized to G, the equivalent to (l)-(4) when the 
time series, x(j), at particular spatial coordinate, j € 
{1, . . . , M}, is decomposed in time, is: 

x(j) = G-CT(j)=g 1 a 1 U)+g 2 a 2 (j) + --- + g Q a Q (j), 

(5) 

where the temporal basis functions, g l , . . . , which 
are the columns of matrix G, are unknown time series 
describing a fixed dynamical behavior. In contrast 
with the nonlinear case, the basis functions, g x , are 
independent of the geographical location, j. Each g i 
could be a signal with a different physical cause oper- 
ative in a particular geographical region represented 
by a different component map {< 7 i(j) ; j = 1 , . . . M). 
The goal of the analysis is to infer the unknown con- 
tributing components, cr(j), from the observed data, 
x(j). In the linear case, we write (2) as 

h = J • x ^ cr, ( 6 ) 

where J is an estimate of the unknown matrix G~ 1 
(the superscript -1 represents the pseudo-inverse if 
G is not square) and h is an estimate of the unknown 
vector cr. The ability of statistical analysis techniques 
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to retrieve good estimates, /i, of the true components, 
is highly dependent on the quality of the statistical 
dataset used (i.e., sufficiently large number of inde- 
pendent examples is needed to sample all the varia- 
tions involved) and on the technical assumptions that 
are made about J and h. 

There are many techniques that have been devel- 
oped to estimate J and h. The one most frequently 
used by the climate research community today is the 
Principal Components Analysis (PCA) or Empirical 
Orthogonal Function (EOF) method ( Lorentz 1951; 
von Storch and Frankignoul 1998). Sometimes, mod- 
ifications of this method are used that either apply 
some other criterion besides maximizing the variance 
explained by each component or relax the require- 
ment for orthogonal basis functions. These methods 
use this additional information to rotate, orthogonally 
or obliquely, the solution of a previous PCA; we refer 
to this large set of methods as Rotational Techniques 
(RT) (see Horel 1981; Richman 1986). We formulate 
a general linear case and the classical analysis tech- 
niques in Section 2. We will show some of the known 
difficulties of the PCA solution (Karl et al 1982; 
Richman 1986) that have led to the development of 
the rotational techniques. 

The Independent Component Analysis (ICA) is in- 
troduced in Section 3. This method is based on infor- 
mation theory and has been recently developed in the 
context of signal processing studies and of the devel- 
opment of neural coding models ( Jutten and Heraut 
1991; Atick 1992; Bell and Sejnowski 1995). This 
technique has now been studied for some time by the 
statistical analysis research community and many re- 
cent applications of the ICA paradigm can be found 
in the ICA ’99 proceedings ( Cardoso et al. 1999) or 
( Hyvdrinen and Oja 2000), but this method has not 
been used for analysis of climatological observations 
(see, Aires et al 2000) The two major distinctions be- 
tween the ICA approach and the classical techniques 
are: 

• The method extracts statistically independent 
components, even if these components have non- 
Gaussian probability distribution functions, by 
making use of higher-order statistics, whereas 
the PCA or RT approaches use only second- 
order statistics. 

• A linear mixture model is not assumed; any 
extraction model could be used with the ICA 
paradigm ( Burel 1992), which allows the intro- 
duction of pertinent a priori information about 


the mixture model, if it is available. 

We will show that the “linear” (this term will be 
explained in the following) and PCA-initialized form 
of ICA, described here, performs a rotation of the 
PCA solution, eliminating the mixing problem: PCA 
has the tendency to mix several components, even 
when the signal is just their linear sum. We argue 
that, because of certain general features of the ICA 
approach, it is a particularly promising technique for 
rotations of the PCA solution. Furthermore, a previ- 
ous study that applies a similar ICA to the analysis 
of variations of tropical sea surface temperature time 
series ( Aires et al. 2000) illustrates its potential to 
separate a geophysical time series into more mean- 
ingful components. 

To illustrate most clearly how ICA avoids the com- 
ponent mixing problem, we construct a synthetic 
dataset, where the true answer to the decomposition 
problem is known, and apply PCA-initialized Inde- 
pendent Component Analysis to extract components 
(Section 4). We have deliberately devised a dataset 
with certain characteristics to challenge whether ICA 
can separate distinct modes of variability. In partic- 
ular, the synthetic dataset is formed by a linear sum 
of components, some of which are well-separated in 
space and time, some of which overlap and some of 
which represent teleconnections. This dataset is then 
created so that it has the structure that linear compo- 
nent extraction techniques search for. We show that, 
even in the linear case, the PCA technique mixes the 
components, but that the ICA method performs a ro- 
tation to correctly separate the components. 

The goals of this paper are then to illustrate some 
of the problems of classical analyses, even in the sim- 
ple linear case, and to introduce a new component 
extraction technique that overcomes these problems, 
at least in its linear form (Section 5). We compare 
linear-ICA to PCA to measure the effect of the rota- 
tion transformation by the ICA algorithm. We apply 
these methods to a synthetic dataset, rather than to 
an actual geophysical dataset, so that the true an- 
swer is known independently and so that we can illus- 
trate several specific features of the component mix- 
ing problem. We do not use a priori information for 
the component extraction experiments since we are 
interested in exploratory techniques, not confirmatory 
ones, i.e. we wabt a technique to find the correct, but 
unknown components, not confirm results from other 
analysis. 
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2. The Linear Case and Classical 
Component Extraction Techniques 

A common approach for statistical component ex- 
traction is to require the decorrelation of the ex- 
tracted components, in which case the covariance ma- 
trix of extracted components < h l h > is constrained 
to be diagonal; but this decorrelation constraint has 
an infinity of solutions because 

j = e ( 7 ) 

where © is any undetermined Q x Q matrix so that 
0* © = IqxQ- Jo = L~ 1//2 • E l is a Q x N matrix 
with £ the truncated diagonal matrix of the higher 
(in decreasing order) eigenvalues of < x l • x > and 
E the N x Q matrix with the associated normalized 
eigenvectors in the columns. 

One particular decorrelation solution is the well- 
known Principal Component Analysis (PCA) or, in 
the geophysical community, Empirical Orthogonal 
Function 1 (EOF), first used in atmospheric sciences 
by Lorenz (1951). In this technique, an additional 
constraint is added to resolve the indeterminacy of the 
decorrelation solutions: the successive extracted com- 
ponents have to explain the maximum remaining vari- 
ance. This solution is given by taking © = Iq x q in 
Equation (7). Depending on which space the PCA is 
applied to (space, time, frequency, multivariate data, 
etc), the PCA has also been called Singular Spec- 
trum Analysis ( Broomhead and King 1986; Vautard et 
ai 1992), Multi-channel Singular Spectrum Analysis 
( Vautard et al. 1997), Extended Empirical Orthogo- 
nal Functions (Korns et al 2000), Multivariate Em- 
pirical Orthogonal Function (Xue et al. 2000), etc. 

Three well-known problems arise when using the 
PCA technique. (1) Even if the mixing of the compo- 
nents is linear as in Eq. (5), the maximum-explained- 
variance assumption can lead to a different mixing in 
the extracted components ( Kim and Wu 1999) as we 
will be shown here (see Figure la for an schematic 
illustration of this problem in a 2-dimensional case). 
(2) This mixing problem is also particularly serious 
when the PCA is applied to data that have more 
than one component with about the same variance. 
In this case, the problem is not solvable since any or- 
thogonal rotation of the principal components (i.e., 
in the space of the ‘‘degenerate” eigenvectors) will be 
a PCA solution (Figure lb). (3) Since PCA imposes 


^OF is a specific form of the general PCA were the ex- 
tracted basis functions are normalized. 


orthogonality on the extracted basis functions, mix- 
ing problems also arise when the actual physical ba- 
sis functions are not orthogonal (Figure lc). Another 
problem for the application of the PCA to geophysical 
data arises from an irregularly spaced grid of pixels 
that can lead to distorted basis functions (Karl et al 
1982). 

The PCA assumptions (linearity, orthogonality, 
maximum variance explained by each successive com- 
ponent) used to resolve the solution indeterminacy 
are not known, a priori, to be valid for a particular 
dataset. If these assumptions are not valid, variations 
that are not physically connected could be artificially 
mixed together into one extracted component (i.e., 
the mixing problem). This is the reason why PCA is 
often used in restricted geographical domains instead 
of global domains or applied to pre-filtered data to try 
to isolate a single dominant mode of variation, which 
PCA can correct!} identify. Thus, although PCA is 
useful for compressing information by describing the 
most variance with, the fewest terms in an expansion 
(as a dimension-re iuction/compression technique), it 
can lead to misinte/pretation of physical relationships 
when used as a component extraction technique. 

Rotational Techniques (RT) were introduced ( Horel 
1981; Rickman 1981), in part, to obtain a more phys- 
ically interpretable solution and to avoid some of the 
problems of PCA. in these approaches, an additional 
constraint of localisation, based on the so-called “sim- 
ple structure” principle, is used to solve the indeter- 
minacy of the decorrelation solutions. The rotation 
can be orthogonal (the rotation matrix is an orthog- 
onal matrix) or oblique (this constraint is relaxed). 
There exist many proposed localization criteria: quar- 
timax, varimax, transvarimax, quartimin, oblimax, 
etc (see the review paper of Rickman (1986) on this 
subject). Two general distinct classes of RT solution 
can be distinguished: confirmatory RT where a pri- 
ori information about the components is available and 
we want to verify the hypothesis, and exploratory RT 
where almost no a priori information on the prob- 
lem is available. We are interested, in this study, in 
the exploratory case where no a priori information 
is available. Since there is no general principle for 
choosing a particular localization criterion from this 
large set of proposed solutions is available, use if a 
particular RT method in exploratory mode may be 
equivalent to introducing a priori information about 
the localization that may not be well suited to the 
particular problem. 

We describe here the most commonly used criterion 
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for orthogonal rotation: 


y,(G) = E2>u) 4 - t?E [e^) 2 


( 8 ) 


i=i t=i 


j=l Lt=i 


where the constant 7 gives a family of rotations, with 
7 = 1.0 giving varimax rotations, and 7 = 0.0 giving 
quartimax rotations. The implementation of these 
techniques is not trivial, but automatic routines are 
available (see, for example, the routine G03BAF in 
the NAG Fortran Library Routine Document). 

Despite the proposed alternatives to the variance 
maximization assumption and the orthogonality con- 
straint used in various RT methods, they still all share 
two fundamental properties with PC A: they assume 
that the meaningful components are linearly mixed 
(classical techniques are intimately linked to the lin- 
ear assumtion and can not be generelized to nonlinear 
models) and that only second order statistics need be 
evaluated. 


3. The Independent Component 
Analysis technique 

In this section, we introduce the main concepts un- 
derlying the Independent Component Analysis (ICA) 
technique. For more details, the interested reader is 
referred to Bell et al. (1995) and Aires et al. (2000). 
The ICA technique aims to extract statistically inde- 
pendent components, a stronger constraint than the 
decorrelation requirement of the PC A. 

The statistical independence of two variables, h\ 
and /i 2 , is determined when their joint distribution 
can be factored: 

P(h u h 2 ) = P(hi) ■ P(h 2 ). (9) 

This constraint involves higher-order statistics whereas 
the decorrelation constraint only involves second-order 
statistics (i.e., mean and variance). Decorrelation 
is equivalent to statistical independence only in the 
case where the quantities are Gaussian distributed, 
so the higher-order statistics are particularly impor- 
tant when the analyzed data have components with 
non-Gaussian distributions ( Comon 1994). Avoiding 
the a priori assumption that second-order statistics 
are sufficient is important when the components are 
unknown as is usually the case. 

It is also important to distinguish the non-Gaussian 
character of the components, <7, with the non-Gaussian 
character of the data, x, itself in Eq. (5). If 


the data have a non-Gaussian distribution, then at 
least one component is also non-Gaussian, since for 
the simplest linear mixture of Gaussian components, 
the distribution would be Gaussian, but a non-linear 
combination of Gaussian distributions could be non- 
Gaussian. Some previous studies examine this non- 
Gaussian behavior in the data ( Burgers and Stephen- 
son 1999; Aires et al. 2000). 

A variable is characterized by all its statistical cu- 
mulants: the first cumulant is the mean, the sec- 
ond cumulant is the variance, the third cumulant 
is the skewness, the fourth cumulant is the kurto- 
sis, etc ( Press et al , 1992). For Gaussian variables, 
cumulants higher than 2 are zero. When data have 
zero-mean, the “skewness” skew(A r ) = an d 

the “kurtosis” kurt(A') = - 3. These cumu- 

lants are often used to test a departure from Gaus- 
sian behavior. The skewness measures the symme- 
try of the probability distribution function: when the 
skewness is positive, larger events are more proba- 
ble then smaller events, and the reverse is true when 
the skewness is negative. The kurtosis is a measure 
of the sharpness of the distribution: a negative kur- 
tosis indicates that the distribution has a broaderh 
central peak and larger tails than a Gaussian distribu- 
tion (sub-Gaussian), a positive kurtosis indicates that 
the distribution has a sharper central peak (super- 
Gaussian distribution). The non-Gaussian character 
of a variable is intimately linked to nonlinear dynam- 
ics ( Palmer 1999). For example, a nonlinear dynam- 
ical system with two atractors can result in binomial 
distributions. Thus, without a priori information on 
the Gaussianity of components in an analysis of geo- 
physical time series, the use of ICA is recommended 
since its requirement of statistical independence is 
more general than the decorrelation assumption. 

The time series observations are gathered into a 
dataset, Xf, of M observations, x(j ) = (x/ ; t = 
1 , . . .,7V), with j £ {1 , . . Af}, where M is the spa- 
tial dimension of the time series and N is its tempo- 
ral dimension. The times series, x(j), is assumed to 
be a mixture of statistically independent components 
cr = {a i ; i = 1 


x(j) = Q{cr{j)) 


( 10 ) 


where Q is an unknown mixture operator, which is, 
by hypothesis, non-singular (i.e. it can be inverted). 

The goal of ICA is to retrieve a function J : x -4 h % 
where h is an estimate of a and the terms {hi ; i — 
1 , . . . , Q } are statistically independent. The estimate, 
h y is defined as a deterministic function (linear or not) 
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of the observations: 


hi = Ji(W u x) ; i= 1 (11) 

where { W t ; i = 1, . . . , £?} is the set of parameters 
of J . As in RT, the number of components, Q , is 
here supposed to be known. This number can be es- 
timated, in easy cases, by a break in the frequency 
spectrum of the data; for more difficult spectra, see 
for example ( Joliffe 1986). With real observations, Q 
depends on the analysis objectives: extracting a lot 
of components allows for more complete description 
of the variability but makes the interpretation much 
more complicated, whereas extracting fewer compo- 
nents focuses attention on fewer different phenomena 
at the cost of explaining less of the variability. The 
reader interested in this tonic should refer to an arti- 
cle by Nadal et al (2000). 

The parameters, W x , are estimated by apply- 
ing a gradient descent algorithm to a cost func- 
tion that specifies the statistical independence of the 
{h t ; i = 1 ,. . . ,<2}. Different equivalent cost func- 
tions can be used; we use here the infomax approach 
to ICA ( Nadal and Parga 1994) from which simple 
algorithms have been derived ( Bell and Sejnowski 
1995). Information theory is used to specify the sta- 
tistical independence cost function: the fundamen- 
tal quantity is information redundancy. Given Q 
variables, h\ , h 2 , . . . , Hq , the information redundancy, 
7Z(hi , h 2 , . . . , hq), is defined as the Kullback diver- 
gence ( Dacunha-Castelle and Duflo 1982) between the 
joint distribution, P h (h \ , h 2) . . . , h Q ), and the factor- 
ized distribution, Pi(h\) ■ P 2 (h 2 ) • • • Pq(^q)' 


71(h) = 



dh{ Ph(h) log 


Ph{h) 

nf=, mi) 


(12) 


This information redundancy measures the differ- 
ence between the joint and the factorized distribu- 
tion: when the redundancy 7 i(h) = 0, Pu(h) = 
Yl?=i which means, by the definition in equa- 

tion (9), that the components of vector h are sta- 
tistically independent. An important remark is that, 
since no geometric constraint on the basis functions is 
specified in the redundancy quality criterion of (12), 
the base functions extracted by ICA, in contrast to 
PC A, can be non-orthogonal. 

A statistical regression model for the extraction 
model in Equation (11) has to be specified. For the 
nonlinear mixture case the regression model needs to 
be nonlinear in order to simulate Q~ l . The Multi- 
Layer Perceptron (MLP), an artificial Neural Network 


model, could be chosen for such a case. The nonlinear 
mixture case will be the subject of future work. 

In that present linear mixture case, we use a sim- 
pler MLP architecture with no hidden layers as the 
extraction model. This neural mapping is defined by 
(from right to left in Figure 2): 


y = f(h) = f(J • 2 ), (13) 

where / is the logistic function (nonlinear, bounded, 
and invertible). This model is basically a linear 
model, but it employes a nonlinear logistic function. 
A nonlinear / is used, even if the mixture model is lin- 
ear, for algorithmic considerations ( Nadal and Parga 
1994): to obtain statistical independence, the manip- 
ulation of higher moments (like < h t 3 >,< hi 4 > 
,< hi 5 >, ...) is required. Applying nonlinear /,■ 

to the /ij’s allows one to include these higher-order 
moments because the Taylor expansion of f(h) uses 
higher powers of the h{ values. So we consider a non- 
linear transformation (postfiltering step) on the esti- 
mator h = J • 2 : the extracted components aie not 
the output y or the neural mapping (13), but the 
vector h = J * 2 . 

The parameters, W ix defining the matrix ^ and 
the optimal transfer functions / have to be deter- 
mined by minimizing the redundancy criterion in 
(12). Practically, it has been demonstrated in vari- 
ous applications ( Bell and Sejnowski 1995) that full 
optimization of the transfer functions is not necessary 
for performing ICA. Although promising results have 
been obtained, this analysis strategy can be improved 
by introducing some partial adaptation of the trans- 
fer functions to that particular problem. We use here 
the classical sigmoide function f(x) = 1/(1 + e~ 0mx ) 
that has proven generally usefull. 

With the information redundancy reduction cri- 
terion, (12), and a no-hidden-layer architecture, a 
straightforward algorithmic implementation of the 
ICA has been found ( Bell and Sejnowski 1995) to es- 
timate the matrix J: 


Jik( n + 1) — Jik (n) + p(n)(J x k + y x • ^ J • hi) (14) 

/ 

where Jik(n) is an element of the matrix J at step n of 
the gradient descent, p is the learning rate parameter 
of the gradient descent, and 


Vi = 


d dyi 
dyi dh x 




(15) 


This algorithm is described in a more practical way 
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in the appendix 2 . Note that, although the theory 
behind this analysis method may seen complex, the 
actual computational procedure that, results for the 
linear case is relatively simple. 

ICA can be applied to the raw data, x(j ), but it 
has been shown ( Nadal et al. 2000; Aires et al. 2000) 
that a PCA pre-processing of observations makes the 
gradient descent step stabler and faster. The TV- 
dimensional data, x(j ), is projected onto its first N ' 
(where N' < N) principal components using the ma- 
trix Jo of PCA: observation noise is reduced and fewer 
free parameters need to be estimated because the di- 
mensions of the matrix J are considerably reduced, 
from Q x N to Q x N ' The dimension of the com- 
pression TV ' is chosen to be equal to Q, the number 
of extracted components, thus, J is a Q x Q matrix. 
The PCA-compression pre-processing step does not 
alter the components extracted by ICA if the neg- 
ligected components (including noise) are the statis- 
tically weak sources (see Nadal et al. 2000 for a def- 
inition of the term “weak”). In this configuration, 
the ICA »,an be considered equivalent to performing 
an oblique rotation of the PCA solution by generaliz- 
ing decor: elation to statistical independence, but no 
additional criterion must be selected from among a 
large numoer of possibilities like the localization con- 
straint in classical RT. The rotation matrix of the 
PCA solu ion J 0 is the ICA solution Q x Q matrix 
J. If the real physical components are Gaussian- 
distributed, then decorrelation is sufficient and the 
ICA algorithm would not change the PCA solution 
since zero-gradient is already reached. In the case 
where the components are non-Gaussian, the ICA 
rotates the initial PCA solution. So ICA improves 
the PCA solution only in the non-Gaussian case. We 
note that there are many examples in the literature 
showing non-Gaussian distributions of climate param- 
eters (e.g., cloud optical thickness ( Rossow and SchiJ - 
fer 1991), precipitation water path ( Lin and Rossow 
1996) abd SST ( Aires et al 2000)). 


2 See also the web page 

http://www.cnl.salk.edu/~tewon/ica_cnl.html of the Compu- 
tational Neuroscience Laborabory of Terry Sejnowski at The 
Salk Institute for links to recent literature, software and demos 
concerning the ICA paradigm. 


4. Application to a linear sum of 
components 

a. Construction of the synthetic dataset 

Geophysical time series have been analyzed by lin- 
ear statistical extraction techniques for decades. The 
synthetic dataset used in this study is generated to 
mimic the apparent expectations of such an analy- 
sis approach, namely, that the observations are a lin- 
ear sum of modes with very different space and time 
variations and, so, are separable by such an analysis. 
However, we also include two modes that are spatially 
overlapped, but with different time behaviors, and 
two spatially separated modes with the same time be- 
havior representing a teleconnection. We st lect Q — 6 
components representing six different dym meal phe- 
nomena, each described by a different temporal basis 
function, g i (solid lines in Figure 3), constructed from 
composites of sinusoids with different frequencies and 
phases. Each basis function has been normalized to 
give a temporal standard deviation of unity. The tem- 
poral dimension of these basis functions is taken to 
be N = 365 (e.g., one year of daily data). A spatial 
resolution of 2.5° x 2.5° is chosen, corresponding to 
M = 144 x 72 = 10368 pixels. Finally, the dataset, 
X/ = {x(j) e R N ; j = where R N is 

the space of real vectors of dimension TV, N = 365 
and M = 10368, is formed from the time series x(j) 
for each pixel j by the linear sum of the basis func- 
tions, x(j) = g x o\{j) + ■ + OqVqU) + e (linear 
model of Eq. 5). The term e is Gaussian-distributed 
noise (zero mean and standard-deviation of 0.5), rep- 
resenting very noisy data as might be the case when 
analysing climate anomalies. 

The {<Ti(j) ; i = indicate the strength 

of each component, z, at each pixel, j, i.e., the spa- 
tial distributions. These strengths are constructed to 
have a geographical Gaussian distribution, giving a 
different ellipsoidal distribution for each component 
(left column in Figure 4). Artificial land contours are 
introduced in the display of o x for easier description 
of the modes. One of the components has two peaks 
in its spatial distribution (near the Americas) to rep- 
resent a teleconnection pattern (map of component 1 
in left column of Figure 4), so the total number of 
ellipsoidal peaks is seven. Also, the geographical ex- 
tent of two of the components overlaps in the Indian 
Ocean (maps of components 4 and 6 in left column 
of Figure 4) to complicate the component extraction 
process. 

The variance contributed by the Q = 6 compo- 
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nents and the added noise are shown in Table 1: the 
components produce 67 % of the total variance and 
the noise produces 33 %. The total variance of a 
component results from the combination of the tem- 
poral variability of the basis function (as a function of 
normalized amplitude and frequency) and the spatial 
extent of the component . 

b. Results of PCA and ICA 

The PCA components are determined by comput- 
ing the matrix J 0 - The best number of PCA compo- 
nents to extract is determined here by observing the 
spectrum of cumulative percent of variance explained 
by the PCA components (Figure 5). More sophisti- 
cated criteria have been developed to determine the 
number of significant components (see for example 
( Joliffe 1986)). The first Q = 6 PCA components 
represent 67.7 % of the total variance and the 359 
remaining components explain equal portions of the 
remaining 32.3 % of the total variance, representing 
the noise in the dataset. The PCA temporal basis 
functions (crossed lines in Figure 3) are each com- 
pared with the real basis function to which it best 
corresponds. PCA basis functions 2, 3 and 4 provide 
a relatively good estimate of the true functions, al- 
though there are some errors near the peak values. 
PCA temporal basis functions 1, 5 and 6 (low fre- 
quencies) are much worse fits. In particular, higher 
frequencies have been mixed into the real basis func- 
tions. 

The corresponding PCA component maps are de- 
fined at pixel j by the values (o\ , . . . oq){j) = J 0 ’ X f 
and are shown in Figure 4 (middle column), where 
Xj* is the j ih column of data matrix X . We see that 
the PCA (or EOF) technique confuses elements from 
the different components, the general mixing problem, 
such that all of its components exhibit many more ge- 
ographic peaks than in the real components. Even if 
the corresponding PCA temporal basis function is rel- 
atively well retrieved, the corresponding component 
map still exhibits the mixing problem (see expeciallv 
PCA basis function 2 in Figure 3 and the correspond- 
ing PCA component map in Figure 4). One cause of 
the mixing is well illustrated in Table 1 where the vari- 
ance explained by each PCA component is compared 
to the variance of the actual components. The first 
PCA component explains 24.4 % of the total variance, 
which is much more than its true variance of 13.3 %. 
The 6 th PCA component represents only 3.1 % which 
is a considerable underestimate of the real value of 
10 %. Thus, the variance maximization constraint 


on the solution in PCA shifts signal from other com- 
ponents into the first component, producing a mix- 
ture of many true component variabilities. The noise 
level estimate of 32.3 % is a good estimate, but its 
small under-estimate of the real noise is due to the 
projection of noise into the first 6 PCA components 
(representing 0.7 %). 

Particularly notable in Figure 4 is that if not ro- 
tated the mixing tendency of the PCA could suggest 
many more teleconnections in observations than are 
actually present. Since in this synthetic case all six 
components contribute roughly the same amount of 
variance (10-13 %), the PCA technique has combined 
many of the actually-separate components into sev- 
eral of its components, trying to maximize the amount 
of variance explained by each. However, the method is 
then compelled to alternate positive and negative val- 
ues to compensate for having too much variance when 
the components are added back together. This effect 
is especially apparent for the overlapping components 
in the Indian Ocean: two PCA basis functions possess 
broad central peaks spanning the geographic distri- 
bution of both of the real components and two oth- 
ers possess, in this same location, two opposite-signed 
peaks (see PCA component maps of components 1,4, 
5 and 6 in Figure 4, middle column). The alternating 
positives and negatives in PCA are partly the result 
of the orthogonality constraint. A similar projection 
of real components into more than one PCA com- 
ponent occurs when a geographically isolated mode 
moves during the time period (Kim and Wu 1999). 
Moreover, the component with two peaks near the 
Americas, representing a real teleconnection, shows 
up in four of the PCA components (components 1, 3, 
4 and 6 in Figure 4, middle column), but mixed with 
other components as well, suggesting teleconnections 
between the Americas and the South Atlantic and In- 
dian Oceans that do not exist. 

ICA can be applied directly to the raw data, x(j ), 
but, as previously commented in Nadal et al (2000) 
and more briefly at the end of section (), a PCA 
pre-processing of observations makes the gradient de- 
scent step numerically stabler and faster. So the ob- 
served data, x(j), are first projected onto the first 
<2 = 6 PCA components using the matrix J 0 . The 
ICA technique is then applied to the pre-processed 
data, x(j) = J o- x(j) (dimension <2 = 6 instead of 
N = 365). As explained in section (), this is equiva- 
lent to performing a rotation on the PCA initial solu- 
tion where the rotation matrix is the <2 x Q-matrix J 
of the ICA solution. Thus the 6 ICA extracted com- 
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ponents explain the same amount of variance as the 
6 PCA components (67.7 %). 

A varimax rotation of the PCA solution (a RT 
method) has also been obtained for this case. Re- 
sults (not shown) are improved for some of the com- 
ponents since the real components are geographically 
well localized, but the mixing problem still remains. 

The six 1CA basis functions are shown in Figure 
3 (dotted lines). The ICA basis functions are very 
similar to the real basis functions. This compari- 
son shows how the ICA technique has corrected its 
first guess (the PCA solution) to be closer to the true 
solution. The additional information obtained from 
the requirement for statistical independence is nicely 
illustrated: the ICA technique has transformed the 
PCA initial solution for a better retrieval of all six 
components. The ICA component maps are presented 
in Figure 4 (right column). Generally, the compo- 
nents are well-retrieved and separated even the tele- 
connection mode (ICA component 1 in Figure 4, right 
column) and the two overlapping modes in the Indian 
Ocean (ICA component 4 and 6 in Figure 4, right 
column). The transformation of the PCA component 
maps by ICA is always an improvement. 

An experiment was conducted with Tie same data 
but without the noise (not shown). 1 he ICA sepa- 
rates the original six modes almost perfectly and the 
ICA solution is very close to the real solution. This re- 
sult indicates that the presence of measurement noise 
in a dataset will produce a small amount of mode 
mixing even in the ICA solution; however the results 
shown here (small hints of other modes in ICA com- 
ponent 4 and 6 in Figure 4, right column) is produced 
by a situation where the signal to noise ratio is only 
about two. Although this situation may be relevant 
to climate studies, ICA can separate most of the noise 
into its own statistically independent mode. 

Table 1 shows that the variance explained by the 
ICA components is much closer to the real solution 
than the initial PCA components: the variance ex- 
plained by the first couple of modes decrease and that 
retained by the remaining modes increase. Differences 
between the true and ICA explained variance for each 
component are less than 0.6 %, where the discrepan- 
cies are the result of the projection of some part of 
the noise into the ICA components. 

5. Concluding remarks 

For extraction of physically meaningful modes from 
observations, where the characteristics of the system’s 


dynamics are not (well-) known, identifying statisti- 
cally independent variation modes seems to be a sen- 
sible alternative for the rotation of a first PCA (i.e. 
EOF) solution which is sensitive to the mixing prob- 
lem. Our simple example shows that in the most gen- 
eral, though still linear case (the most favorable con- 
dition for PCA), PCA will mix modes of comparable 
magnitude, generating spurious regional overlaps or 
teleconnections where none exist or distorting exist- 
ing overlaps or teleconnections. We have shown the 
potential of the ICA technique for separating a com- 
plex signal in a more meaningful way. The mixing 
problem inherent in the PCA technique and the ar- 
tifacts produced by the orthogonality and maximum- 
variance constraints of PCA are avoided when rotated 
by ICA. Moreover, the use of higher-order statistics 
by ICA to determine statistical independence assumes 
only the generalization of the decorrelation used in 
all classical approaches. Nevertheless, even statisti- 
cal independence does not guarantee that the modes 
produced by different physical processes will be sep- 
arated. 

ICA, by finding statistically independent modes, 
may provide a better way to explore the unknown dy- 
namics of a system. In the case of climate variations, 
where the components of the system are probably 
coupled (see for example, Salby and Callaghan 2000 
or Krishnamurthy and Goswami 2000), considering 
the modes to be as statistically distinct as possible, 
even with a linear-ICA, would provide “prototypical” 
components that might serve as a guide to further in- 
vestigation. As with the classical PCA technique or 
classical RT, this first (linear) ICA algorithm is not 
able to deal correctly with propagating components 
or components mixed nonlinearly. However, the ICA 
paradigm (statistical independence) may be a suffi- 
ciently powerful concept to be generalized using more 
advanced statistical models (e.g., more complicated 
neural networks) to treat nonlinear problems. This 
requires development of nonlinear solution algorithms 
and their testing for cases where the combination of 
modes is nonlinear, when components are physically 
linked, and for cases with propagating modes. 

Appendix: Principal steps of the 
algorithm 

We adopt here the linear model x — G c r, where 
x is the observation, G is the basis function ma- 
trix and (T is the vector of components to estimate. 
The goal of the statistical decomposition technique 
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is to estimate a matrix J = G~ l (the superscript 
-1 represents the pseudo-inverse if G is not square), 
the filter matrix, using only a dataset of observations 
{x e ; e = 1, . . , , £*}, where E is the number of sam- 
ples in the dataset. With thr matrix J applied to 
each observation x, the components a are estimated 
by a ~ h = J • x, and the basis function matrix G is 
estimated by the inverse matrix J ~ l . 

The principal steps of the time series analysis by 
the ICA technique are; 

• Optional pre-processing: The dataset X / = 
{x(j) € R n ; j = 1, . . . , M}, where t is the time in- 
dex and j is the space index (geographical locations), 
may require pre-processing: (1) spatial, temporal or 
spatio-temporal interpolation to fill in missing data, 
(2) filtering of data to suppress undesirable frequen- 
cies (noise effects), (3) de-trending to obtain station- 
ary data, and (4) removing the annual cycle to ex- 
amine interannual anomalies. None of these steps is 
required. 

• Chose the space for the decomposition: (1) in 

time, which is the approach we have adopted in our 
study: 

*(;) = 9\<*iU) + ... +9q°qU) + e, 

(2) in space, (3) in frequency, or (4) in a mixture of 
these spaces. The observations (a time series or a 
geographical field, ...) are denoted, in the follow- 
ing, by the d-dimensional vector x e and the dataset 
is {x e ; e = 1,. 

• Center the dataset: The observation mean < 
x e > is removed from the dataset: x e «- x e - < x e >. 
This step is necessary for statistical techniques where 
data arc supposed to have zero-mean like ICA. 

• Optional normalization. If the user wants to put 
the same statistical weight on each coordinate of the 
observation x € : then the dataset can be normalized 
by the standard-deviation vector x e 4- x e /e x . 

• Optional Eigen- vector decomposition: The co- 
variance (or correlation, in the case of normalized ob- 
servations) matrix < x l x > is estimated from the 
dataset. The eigenvalues A (diagonal matrix) and 
the eigen-vector matrix V of < x i • x > are then 
computed using a classical numerical routine. The 
number of PC A or ICA extracted components Q is 
chosen by observing the spectrum of eigenvalues. 

• Optional PCA solution: The PCA solution is 
computed to pre-process the data: 

- The d x Q PCA basis function matrix, Gpca , 
contains in its columns the first Q eigen-vectors 


of V (the columns of V represent time series in 
the time decomposition, and geographical field 
in the space decomposition, . . . ). 

- Since, by definition, V 1 = V 1 , the filter PCA 
matrix, J pc a, is equal to the transposed Q x d 
basis function matrix, Gpca- Then, the ex- 
tracted components, /i, that estimate the true 
components, <7, are the projection of the obser- 
vations, x, onto the filters: h = Jpca * x. 

- The first Q eigenvalues in A represent the vari- 
ability explained by each of the Q components. 

• ICA solution: 

1 Pre-whitening of dataset: The PCA solution is 
used as a pre-processing step: the observations 
x e are projected onto the PCA filters: x e «— 
J pc a ■ x € . The ICA algorithm is then applied 
into these <2-dimensional data. 

2 The ICA solution, Jjca , is initialized as the 
identity matrix Iq*q . This, associated to the 
previous whitening step, is equivalent to taking 
the PCA solution as first guess for ICA. 

3 For the minimization of the criterion specify- 
ing the statistical independence, a gradient de- 
scent algorithm is used. The classical gradient 
descent uses all the samples of the dataset to 
compute a mean A Jik in Equation (14). This 
algorithm is called the deterministic gradient 
descent. The major inconvenience of this algo- 
rithm is that it can be trapped in local minima. 
We use, in our applicaton, the stochastic gradi- 
ent descent algorithm that uses the gradient de- 
scent formula (14) iteratively in unique random 
samples of the dataset. The stochastic charac- 
ter of the optimization algorithm allows theoret- 
ically, and under some constraint not discussed 
here, for the optimization technique to reach the 
global minimum of the criterion instead of a lo- 
cal minimum (Dufio 1996). 

4 An observation x € is randomly chosen in the 
dataset. The propagation through the neural 
network (chosen model for the component ex- 
traction) is given by: y = f(h) = /(Jjca * x e ), 
where /(a) = [1 + exp(~(3 • a)]" 1 is the logistic 
function (/3 is a parameter controlling the slope 
of the logistic function, we take (3 = 2.0 as in 
(Bell and Sejnowski , 1995). The FORTRAN 
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routine of this process is: 
c - * propagation through the neural network 

do i = 1, d 

h(i) = Q.dO 
do k = 1, d 

h(i) = h(i) + JicA(i,k) * x e (k) 
enddo 

h(i) = h(i) + bia(i) 
y(i) = I.d0/(l.d0 + dexp(-/3 * h(t))) 
enddo 

where 6ia is the classical bias vector in a MLP 
neural network (not shown in the text for sim- 
plicity). We use, in this routine, double preci- 
sion variables to avoid numerical instabilities. 

5 The learning process is then defined as: 
c - - transitory quantities 

do j = 1 , d 

hhh(j) = O.dO 
do k = 1 , d 

hhh{j) = hhh(j) + JjcA{k,j)*h(k) 
enddo 
enddo 

c - - modification of weights 
do i = 1, d 

do j = 1, d 

JjCAihj) = JiCAiiJ) +param* 

& {Jica (i> j) + 0 * (1-dO — 2. dO * 2 /( 0 ) * 
hhh(j)) 

enddo 

bia(i) = 6ia(i) + 0 * (l.dO — 2.d0 * y{i)) 
enddo 

where par am is the learning parameter of the 
gradient descent optimization (we take param = 
0.0005). 

6 Stopping criterion: many criteria can be used 
to define when to stop the learning cycle. The 
simplest criterion is to determine a priori the 
number of learning steps. A better criterion is 
to determine when the difference between solu- 
tion J ica at time t and at time t + 1 falls below 
some threshold value. Another stopping crite- 
rion is to evaluate the statistical independence 


of the extracted components h: cumulants (i.e. 
additive higher-order moments) are a practical 
way to do that, but this approach is computa- 
tionally expensive. The learning algorithm re- 
turns to step 4 until the stopping criterion is 
reached. 

• Analysis of results: When the matrix J ica has 
been determined by ICA, the global ICA filters (tak- 
ing into account the PCA pre-processing) are defined 
by the Q x d matrix: Jglo = J ica ■ J pc a 

- The projection of data is used to estimate the 
components: h = J glo * £ e 

- The d x Q ICA basis function matrix Gqlo = 
Jglo~ 1 = Gpca ' J jca~ 1 is normalized to ob- 
tain normalized ICA basis functions, as in PCA 
approach. 

- Computation of explained variance of each of 
t le basis functions. 
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Figure 1. Illustration of the Problems encountered by PC A when observations, having dimension 2 (coordinates X 
and Y), come from two components defining ellipses El and E2. The line D represents the first PCA axis defining 
the first PCA component: A) mixing due to the maximum-explained-variance constraint, B) indeterminacy when 
two components have same variance, and C) mixing due to the non-orthogonality of components. 

Figure 2. The component extraction model: the perceptron architecture, where x is the observation, h is the 
extracted component vector and y is the network ouput. 

Figure 3. Temporal basis functions, g t : ACTUAL (solid lines), PCA estimates (crossed lines) and ICA estimates 
(dotted lines). 

Figure 4. The maps of the actual components, c>i (left column), of the PCA extracted components, hi (middle 
column), and of the ICA extracted components, h, (right column): components number 1-6 from the top to the 
bottom, component maps have been centered and normalized for comparison purposes. The continental outlines 
are artificial and used to make discussion of specific features easier. 

Figure 5. Ce nulative percent of explained variance by the PCA components. 


Table 1 . Variance explained by Noise, 
REAL, PCA and ICA components 

Component I REAL PCA ICA 

1 1373 24.4 12.7 

2 12.6 14.5 13.0 

3 10.7 10.7 11.3 

4 10.7 8.8 10.2 

5 10.7 6.2 10.3 

6 10.0 3.1 1 0.0 

Noise 33.0 32.3 32.3 
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